The aim of this project….
This project used data from 1500 residential property sales in Ames, Iowa between 2006 and 2012. There are 82 explanatory variables in the data set, containing - nominal, ordinal, discrete, and continuous attributes. Continuous variables provide information about the multiple area dimensions of the house and property, such as the the size of the lot, garage among others. Discrete variables, on the other hand, quantify characteristics of the house/properties like the number of kitchens, baths, bedrooms, and parking spots. Nominal variables, generally, describe the multiple types of materials and locations, such name of the neighborhood or the type of foundations. Ordinal variables typically rate the condition and quality of multiple house characteristics and utilities.
Prior to doing the exploratory data analysis, we hypothesize that the following variables will be the most predictive of home price: lot area, home type, year built, and overall quality. We think these will be the most predictive because we assume that if we were to be in the market for a home, these would be among the top criteria we would consider when deciding which home to purchase.
Furthermore, we also hypothesize that a generalized additive model (GAM) will be the best model to use. We think so because the GAM will be able to combine the strengths of various different other model types including polynomials, cubic splines, and smoothing splines.
Sale Price graph
When it comes to lot area, this dataset has many outliers as shown above. We found that there were 127 outliers greater than the minimum outlier value of 17755. As these made visualization difficult, we temporarily removed them. After removing the outliers, we can see that homes have a somewhat normal distribution in terms of lot area near the median of 9436.5 square feet.
From Figure 3, we see that 1-story homes that were built in 1946 or later make up the bulk of our dataset, specifically 1079. This is over one-third of our total dataset which has 2930 observations. Please not that the graphs are interactive so move your cursor over the graph to see more details.
Furthermore, we can also observe from Figure 4, that most homes were built within a 5 year time range of 2005.
Summary Statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.511 4.000 5.000
We can observe from Figure 5 that there is a large variation in sale price across across different neighborhoods. Even within neighborhood we also see variation. Investigating some housing characteristics may give us insight into the variation observed in price within neighborhoods.
We first examined overall quality (Figure 6) and - as expected - price increases as overall quality increases. Examining year built (Figure 7), we observe that the the newer a home is, the higher its price, on average.
Figure 9 explores the relationship between kitchen quality and sale price.The higher the kitchen quality the higher the median sale price. This increase, however, is non-linear (but rather quadratic). From Figure 10, we can see that - as expected - there is a gradual positive relationship between lot area and sales price.
Sale Price:
Missing data:
We opted for removing any missing observations from our final data set that we used for variable selection and modeling.
Modifying variable class:
We decided to keep the quality variables selected as a continuous variable as opposed to switching it to a factor. We did so because changing it to a factor would have lead to us dropping the “Very Poor” or “1” factor level as this level only has around 4 observations. By keeping the variable continuous, we are able to keep these observations and so better predict the home prices of homes that fall under this category.
Model Selection:
We began our model selection by reducing the number of variables within our housing data set. We created a subset data set that included the variables we hypothesized would important predictors of sale price.
These variables include:
LotArea: Lot size in square feetOverallQual: Rates the overall material and finish of the houseYearBuilt: Original construction dateExterior1st: Exterior covering on houseHeatingQC: Heating quality and conditionFoundation: Type of foundationTotRmsAbvGrd: Total rooms above grade (does not include bathrooms)KitchenQual: Kitchen qualityBsmtQual: Evaluates the height of the basementNeighborhood: Physical locations within Ames city limitsLandSlope: Slope of propertyStreet: Type of road access to propertyHouseStyle: Style of dwellingGarageQual: Garage qualityFence: Fence qualityYrSold: Year Sold (YYYY)We further included additional variables that will be utilized later in the report to create a renovation calculator.
FullBath: Full bathrooms above gradeRoofStyle: Type of roofUsing our subset, we ran 1) a subset selection, (2) forward stepwise selection and (3) a forward stepwise selection for our variable selection. The graphs below are graphs that plot the number of variables against the BIC value for our three methods of variable selection.
Across all variable selection method, the a model with 7 variables has the lowest bIC score. Comparing the variables included in a model with seven variables across the three selection methods, we see that they all share the same variables.
| x |
|---|
| (Intercept) |
| tot_rms_abv_grd |
| overall_qual |
| lot_area |
| Bsmt.Qual |
| Kitchen.Qual |
| NeighborhoodNorthridge |
| BsmtFin.Type.1Unf |
| x |
|---|
| (Intercept) |
| tot_rms_abv_grd |
| overall_qual |
| lot_area |
| Bsmt.Qual |
| Kitchen.Qual |
| NeighborhoodNorthridge |
| BsmtFin.Type.1Unf |
| x |
|---|
| (Intercept) |
| tot_rms_abv_grd |
| overall_qual |
| lot_area |
| Bsmt.Qual |
| Kitchen.Qual |
| NeighborhoodNorthridge |
| BsmtFin.Type.1Unf |
We followed our variable selection analysis with running cross validation that allowed us to produce a 10-fold CV error estimates for polynomial regression, cubic splines, and smoothing splines.
A degree 2 smoothing spline appears to be the best model choice for lot area. It has the lowest CV error and the lowest has the most stable curve.
A degree 6 smoothing spline appears to be the best fit for the total rooms above grade variable. While a lower degree cubic spine is comparable, the cubic spline becomes more unstable at higher degrees.
A degree 6 smoothing spline appears to be a good fit here, however other models appear to do comparably as well.
A quadratic polynomial appear to be the best fit for this model as it has the lowest error.
A cubic spline with 8 degrees of freedom appears to be the best model in this case. Other models are close in CV error and are fairly stable, but the cubic spline model has the lowest error.
A degree three polynomial appears to be the best option as it has the lowest CV error rate. Cubic spline has only one point so it is unclear whether it has a stable trend.
| model | RMSE | MAE |
|---|---|---|
| linear | 32828.06 | 23224.5 |
| gam | 31338.57 | 21132.8 |
gam.fit.sum
##
## Call: gam(formula = saleprice ~ s(lot_area, 2) + s(tot_rms_abv_grd,
## 6) + s(overall_qual, 6) + poly(Kitchen.Qual, 2) + bs(year_built,
## 8) + poly(Bsmt.Qual, 3) + Neighborhood + full_bath_abv_grd +
## Roof.Style + BsmtFin.Type.1, data = training)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -307855.7 -15110.9 -602.1 13204.9 229251.2
##
## (Dispersion Parameter for gaussian family taken to be 1003401562)
##
## Null Deviance: 14893558370411 on 2327 degrees of freedom
## Residual Deviance: 2269694308185 on 2262 degrees of freedom
## AIC: 54925.29
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value
## s(lot_area, 2) 1 867039361341 867039361341 864.100
## s(tot_rms_abv_grd, 6) 1 3083264528360 3083264528360 3072.812
## s(overall_qual, 6) 1 6019444199776 6019444199776 5999.038
## poly(Kitchen.Qual, 2) 2 356665791154 178332895577 177.728
## bs(year_built, 8) 8 240713961304 30089245163 29.987
## poly(Bsmt.Qual, 3) 3 185458199658 61819399886 61.610
## Neighborhood 27 394673946803 14617553585 14.568
## full_bath_abv_grd 1 45638898787 45638898787 45.484
## Roof.Style 5 8518758484 1703751697 1.698
## BsmtFin.Type.1 5 108739787663 21747957533 21.674
## Residuals 2262 2269694308185 1003401562
## Pr(>F)
## s(lot_area, 2) < 0.00000000000000022 ***
## s(tot_rms_abv_grd, 6) < 0.00000000000000022 ***
## s(overall_qual, 6) < 0.00000000000000022 ***
## poly(Kitchen.Qual, 2) < 0.00000000000000022 ***
## bs(year_built, 8) < 0.00000000000000022 ***
## poly(Bsmt.Qual, 3) < 0.00000000000000022 ***
## Neighborhood < 0.00000000000000022 ***
## full_bath_abv_grd 0.00000000001947 ***
## Roof.Style 0.1317
## BsmtFin.Type.1 < 0.00000000000000022 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(lot_area, 2) 1 80.448 < 0.00000000000000022 ***
## s(tot_rms_abv_grd, 6) 5 22.795 < 0.00000000000000022 ***
## s(overall_qual, 6) 5 37.108 < 0.00000000000000022 ***
## poly(Kitchen.Qual, 2)
## bs(year_built, 8)
## poly(Bsmt.Qual, 3)
## Neighborhood
## full_bath_abv_grd
## Roof.Style
## BsmtFin.Type.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#or
#lm.fit
## tot_rms_abv_grd overall_qual lot_area year_built Bsmt.Qual Kitchen.Qual
## 1 2 6 10000 2005 90 3
## Neighborhood full_bath_abv_grd Roof.Style BsmtFin.Type.1
## 1 Somerset 2 Hip ALQ
## full_bath_abv_grd tot_rms_abv_grd home_type overall_qual
## 254 2 5 1-STORY 1945 & OLDER 5
## lot_area year_built Garage.Qual Exterior.1st Foundation Bsmt.Qual
## 254 4853 1924 3 Metal Siding Brick & Tile 80
## Heating.QC Roof.Style Kitchen.Qual Neighborhood
## 254 2 Gable 2 South & West Iowa State University
## Fence Street Land.Slope Yr.Sold saleprice BsmtFin.Type.1
## 254 Minimum Privacy Paved Gentle Slope 2010 104000 Rec
## 254
## 109580.7
## full_bath_abv_grd tot_rms_abv_grd home_type overall_qual
## 254 3 5 1-STORY 1945 & OLDER 5
## lot_area year_built Garage.Qual Exterior.1st Foundation Bsmt.Qual
## 254 4853 1924 3 Metal Siding Brick & Tile 80
## Heating.QC Roof.Style Kitchen.Qual Neighborhood
## 254 2 Gable 2 South & West Iowa State University
## Fence Street Land.Slope Yr.Sold saleprice BsmtFin.Type.1
## 254 Minimum Privacy Paved Gentle Slope 2010 104000 Rec